這 平行轉換 代表了計算哲學的根本轉變,從 時間序列 (一件接一件執行)轉為 空間分佈 (在整個網格上同時執行所有操作)。
1. 獨立性啟發法
這是 GPU 計算的黃金法則: 「只要你的問題是『對 N 個元素獨立地執行某項操作』,這就是你應該首先嘗試的映射方式。」 這種資料平行方法是 GPU 加速中最容易實現的部分,線程管理的開銷遠遠小於並行處理的巨大吞吐量。
2. 精度與資料負載
HIP 核心通常處理大型基本類型陣列。在高性能圖形與機器學習中,我們常使用 float (單精度),而需要極高數值穩定性的科學模擬則使用 double (雙精度)。
3. 從迭代到佔據
在 CPU 程式碼中,處理器透過迴圈「訪問」資料。在 GPU 邏輯中,資料則「佔據」一個線程。你不再撰寫 如何迴圈 而是開始撰寫 特定座標下單一工作人員應執行的動作。
$$\text{索引 } i = \text{blockIdx.x} \times \text{blockDim.x} + \text{threadIdx.x}$$
main.py
TERMINALbash — 80x24
> Ready. Click "Run" to execute.
>
QUESTION 1
What is the primary heuristic for deciding if a problem is suitable for the 'Parallel Pivot'?
The problem requires complex recursion.
The problem involves applying an operation independently to N elements.
The problem must be solved in a strict temporal order.
The problem uses only integer arithmetic.
✅ Correct!
This is the 'Independence Heuristic'—if elements don't depend on each other, they can be processed in parallel.❌ Incorrect
Parallelism works best when tasks are independent; recursion and strict order usually hinder it.QUESTION 2
In the context of the Parallel Pivot, what does the term 'Occupation' refer to?
The CPU visiting each index in a for-loop.
How many blocks are currently queued in the GPU.
Data 'occupying' a specific thread at a specific coordinate.
The percentage of memory used by the float arrays.
✅ Correct!
Correct! We shift from a single CPU 'visiting' data points to many data points 'occupying' threads simultaneously.❌ Incorrect
Occupation focuses on the spatial distribution of work across the thread grid.QUESTION 3
Which data types are most commonly handled by HIP kernels for high numerical stability in science?
bool and char
int and long
float and double
void and pointer
✅ Correct!
Float (FP32) is standard for speed, while double (FP64) is used for scientific stability.❌ Incorrect
While pointers are used to access data, float and double represent the numerical 'payload' being processed.QUESTION 4
When pivoting a loop into a kernel, what replaces the loop counter `i`?
The return value of the function.
A global thread identity calculated from grid/block dimensions.
The hipMalloc address.
The host-side iteration variable.
✅ Correct!
Each thread calculates its own index `i` based on its position in the grid.❌ Incorrect
The host-side loop is removed entirely; the index is now derived from hardware coordinates.QUESTION 5
Fill in the blank: To ensure production reliability even in basic kernels, you must ______.
Only use float types.
Add explicit error-checking macros everywhere.
Use a single thread per block.
Avoid all boundary checks.
✅ Correct!
Exactly! Defensive programming via error macros is mandatory in HIP development.❌ Incorrect
Production code requires robust error handling to catch asynchronous failures.Case Study: Vector Addition Decomposition
Mapping Sequential Logic to a 1D Grid
You are converting a CPU-based signal processing loop `for(int i=0; i<1000000; i++) { signal[i] *= 2.0; }` into a HIP kernel. The target device has Compute Units that prefer block sizes in powers of 2.
Q
Apply the Independence Heuristic: Why is this loop a candidate for the Parallel Pivot?
Solution:
The operation on `signal[i]` does not depend on `signal[i-1]` or any other element. Since each element can be processed independently, we can map the 1 million iterations to 1 million threads.
The operation on `signal[i]` does not depend on `signal[i-1]` or any other element. Since each element can be processed independently, we can map the 1 million iterations to 1 million threads.
Q
If you use a block size of 256, what is the 'Occupation' logic needed within the kernel to handle the million elements?
Solution:
The kernel should first calculate the global ID: `int i = blockIdx.x * blockDim.x + threadIdx.x;`. Because 1,000,000 is not a perfect multiple of 256, a boundary check `if (i < 1000000)` is required to prevent out-of-bounds access by the 'overflow' threads in the final block.
The kernel should first calculate the global ID: `int i = blockIdx.x * blockDim.x + threadIdx.x;`. Because 1,000,000 is not a perfect multiple of 256, a boundary check `if (i < 1000000)` is required to prevent out-of-bounds access by the 'overflow' threads in the final block.